This analysis examines simulated income data for Hungary, focusing on the relationships between income and various demographic factors including age, location, occupation, and gender. The dataset is simulated to reflect realistic patterns while maintaining a manageable size for analysis.
The data was simulated by the data_simulation.R script.
The data is available in the hungarian_income_data.csv
file.
Important things to note about the generated data:
Only the 8 most populated cities of Hungary are taken into count weighted by their population. List of cities: Budapest, Debrecen, Szeged, Miskolc, Pécs, Győr, Szombathely, Eger.
Only the 10 most common occupations are taken into count weighted by their frequency in the workforce. List of occupations: Software Developer, Teacher, Doctor, Sales Representative, Engineer, Accountant, Nurse, Manager, Chef, Driver.
The age distribution is generated by a beta distribution with parameters \(\alpha = 2\) and \(\beta = 3\) and multiplied by 95 to put the end result in the desired range. The beta distribution with the aforementioned parameters skews the age distribution towards younger ages, which is more realistic.
There are three groups of people categorized by their age:
Underage: each person has a random age at which they start working between 14 and 24.
Working age: 19-67
Pension age: each person has a random retirement age between 60 and 75.
Under 18 people have no income.
Working age people have a regular income based on their age, occupation, city, and gender.
Pension age people have a pension based on their occupation and city.
All working age people are considered to be employed.
The income of a working age man is 20.000 HUF higher than the income of a working age woman in the same occupation, city, and age group.
summary_stats <- summary(data)
kable(summary_stats, caption = "Summary Statistics of the Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| age | city | occupation | gender | income | starting_age | retirement_age | |
|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Length:10000 | Length:10000 | Length:10000 | Min. : 0 | Min. :14.00 | Min. :60.00 | |
| 1st Qu.:23.00 | Class :character | Class :character | Class :character | 1st Qu.: 74373 | 1st Qu.:18.00 | 1st Qu.:65.00 | |
| Median :36.00 | Mode :character | Mode :character | Mode :character | Median :320922 | Median :19.00 | Median :67.00 | |
| Mean :37.87 | NA | NA | NA | Mean :296226 | Mean :18.99 | Mean :66.99 | |
| 3rd Qu.:51.00 | NA | NA | NA | 3rd Qu.:469238 | 3rd Qu.:20.00 | 3rd Qu.:69.00 | |
| Max. :94.00 | NA | NA | NA | Max. :856706 | Max. :24.00 | Max. :75.00 |
##
## City Distribution:
##
## Budapest Debrecen Eger Győr Miskolc Pécs
## 3468 1543 658 806 974 853
## Szeged Szombathely
## 1160 538
##
## Occupation Distribution:
##
## Accountant Chef Doctor
## 1190 507 492
## Driver Engineer Manager
## 533 980 775
## Nurse Sales Representative Software Developer
## 1209 2063 772
## Teacher
## 1479
##
## Age Distribution Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 23.00 36.00 37.87 51.00 94.00
##
## Retirement Age Distribution:
##
## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 139 201 347 549 829 1079 1187 1339 1247 1044 841 526 346 177 79 70
##
## Age Group Distribution:
##
## 0-14 15-64 65+
## 1305 7774 920
##
## Income Distribution Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 74373 320922 296226 469238 856706
##
## Income Distribution by Age Group:
print(tapply(data$income, cut(data$age, breaks = c(0, 18, 65, Inf), labels = c("Under 18", "Working Age", "Pension Age")), summary)) ## $`Under 18`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 9084 0 374737
##
## $`Working Age`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 275889 394134 385521 500014 856706
##
## $`Pension Age`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38531 59379 145714 104985 820253
library(forcats)
data <- data %>%
mutate(age_group = cut(age,
breaks = seq(0, 100, by = 2),
right = FALSE,
include.lowest = TRUE,
labels = seq(0, 98, by = 2)))
dem_pyramid <- data %>%
group_by(age_group, gender) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(count = ifelse(gender == "Male", -count, count))
ggplot(dem_pyramid, aes(x = age_group, y = count, fill = gender)) +
geom_bar(stat = "identity", width = 0.8, color = "black") +
scale_y_continuous(labels = abs, expand = expansion(mult = c(0.05, 0.05))) +
scale_fill_manual(values = c("Male" = "#00BFFF", "Female" = "#FF3B3B")) +
coord_flip() +
labs(title = "Population Pyramid of Simulated Hungarian Data",
x = "Age Group",
y = "Count",
fill = "Gender") +
custom_theme +
theme(legend.position = "top",
axis.text.y = element_text(size = 10, face = "bold"),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20))ggplot(data %>% filter(age >= 18), aes(x = income, fill = gender)) +
geom_density(alpha = 0.6) +
scale_fill_viridis_d() +
labs(title = "Income Distribution by Gender",
subtitle = "(working age only)",
x = "Income (HUF)",
y = "Density") +
custom_themeggplot(data, aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
labs(title = "Income Distribution by Age",
subtitle = "(working age only)",
x = "Age",
y = "Income (HUF)") +
custom_themeggplot(data %>% filter(age >= 18), aes(x = reorder(occupation, income, FUN = median), y = income, color = occupation)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by Occupation",
subtitle = "(working age only)",
x = "Occupation",
y = "Income (HUF)") +
custom_themeggplot(data %>% filter(age >= 18), aes(x = reorder(city, income, FUN = median), y = income, fill = city)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, alpha = 0.5) +
scale_fill_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by City",
subtitle = "(working age only)",
x = "City",
y = "Income (HUF)") +
custom_themeggplot(data %>% filter(age >= 18), aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = TRUE) +
scale_color_viridis_d() +
labs(title = "Relationship between Age and Income",
subtitle = "(working age only)",
x = "Age",
y = "Income (HUF)") +
custom_themeincome_by_category <- data %>%
filter(age >= 18) %>%
group_by(occupation, city, gender) %>%
summarise(
mean_income = mean(income),
count = n(),
.groups = 'drop'
) %>%
arrange(desc(mean_income))
# heatmap
ggplot(income_by_category, aes(x = city, y = occupation, fill = mean_income)) +
geom_tile() +
scale_fill_viridis(name = "Mean Income (HUF)") +
facet_wrap(~gender) +
labs(title = "Mean Income by Occupation, City, and Gender",
subtitle = "(working age only)",
x = "City",
y = "Occupation") +
custom_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold"))top_earners <- income_by_category %>%
arrange(desc(mean_income)) %>%
head(10)
kable(top_earners,
caption = "Top 10 Highest Earning Combinations",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| occupation | city | gender | mean_income | count |
|---|---|---|---|---|
| Software Developer | Budapest | Female | 555020 | 104 |
| Software Developer | Budapest | Male | 499488 | 113 |
| Software Developer | Debrecen | Male | 494909 | 45 |
| Doctor | Budapest | Male | 486931 | 70 |
| Software Developer | Szeged | Female | 481669 | 46 |
| Manager | Szeged | Male | 478305 | 32 |
| Manager | Budapest | Male | 467616 | 122 |
| Software Developer | Szeged | Male | 447791 | 39 |
| Doctor | Budapest | Female | 443706 | 91 |
| Software Developer | Miskolc | Female | 443251 | 28 |
# Test if there's a significant difference in income between genders
t_test_result <- t.test(income ~ gender, data = data)
print(t_test_result)##
## Welch Two Sample t-test
##
## data: income by gender
## t = -3.7426, df = 9996.3, p-value = 0.0001832
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -24073.503 -7524.036
## sample estimates:
## mean in group Female mean in group Male
## 288268.7 304067.4
# Test if there are significant differences in income between cities
city_anova <- aov(income ~ city, data = data)
print(summary(city_anova))## Df Sum Sq Mean Sq F value Pr(>F)
## city 7 1.182e+13 1.688e+12 38.83 <2e-16 ***
## Residuals 9992 4.344e+14 4.347e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test if there are significant differences in income between occupations
occupation_anova <- aov(income ~ occupation, data = data)
print(summary(occupation_anova))## Df Sum Sq Mean Sq F value Pr(>F)
## occupation 9 1.529e+13 1.698e+12 39.38 <2e-16 ***
## Residuals 9990 4.309e+14 4.313e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test if income distributions differ between genders
male_income <- data$income[data$gender == "Male"]
female_income <- data$income[data$gender == "Female"]
ks_test <- ks.test(male_income, female_income)
print(ks_test)##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: male_income and female_income
## D = 0.046334, p-value = 4.361e-05
## alternative hypothesis: two-sided
# Test if income distributions differ between cities
kruskal_test <- kruskal.test(income ~ city, data = data)
print(kruskal_test)##
## Kruskal-Wallis rank sum test
##
## data: income by city
## Kruskal-Wallis chi-squared = 263.84, df = 7, p-value < 2.2e-16
# Convert categorical variables to factors
data_working_age <- data %>% filter(age >= 18)
data_working_age$city <- as.factor(data_working_age$city)
data_working_age$occupation <- as.factor(data_working_age$occupation)
data_working_age$gender <- as.factor(data_working_age$gender)
# Fit the model
model1 <- lm(income ~ age + I(age^2) + city + occupation + gender, data = data_working_age)
# Create a beautiful model summary table
model_summary <- summary(model1)
kable(tidy(model_summary), caption = "Multiple Linear Regression Results (Working Age Only)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -564005.7039 | 8854.984102 | -63.693587 | 0.0000000 |
| age | 43984.4061 | 367.359676 | 119.731176 | 0.0000000 |
| I(age^2) | -440.9603 | 3.838656 | -114.873631 | 0.0000000 |
| cityDebrecen | -45773.0706 | 3463.020593 | -13.217672 | 0.0000000 |
| cityEger | -93940.7010 | 4782.185556 | -19.643885 | 0.0000000 |
| cityGyőr | -89210.9797 | 4445.406543 | -20.068126 | 0.0000000 |
| cityMiskolc | -96508.6864 | 4148.656395 | -23.262637 | 0.0000000 |
| cityPécs | -89476.1672 | 4352.994886 | -20.555082 | 0.0000000 |
| citySzeged | -41476.6860 | 3898.003624 | -10.640495 | 0.0000000 |
| citySzombathely | -90903.1668 | 5287.112689 | -17.193348 | 0.0000000 |
| occupationChef | -27765.9163 | 6017.900655 | -4.613887 | 0.0000040 |
| occupationDoctor | 59233.3095 | 6080.352617 | 9.741756 | 0.0000000 |
| occupationDriver | -45601.3129 | 5994.714141 | -7.606920 | 0.0000000 |
| occupationEngineer | 22666.2477 | 4893.333483 | 4.632067 | 0.0000037 |
| occupationManager | 44978.3699 | 5268.892985 | 8.536588 | 0.0000000 |
| occupationNurse | -23498.2703 | 4629.489927 | -5.075780 | 0.0000004 |
| occupationSales Representative | -52814.4425 | 4119.434346 | -12.820800 | 0.0000000 |
| occupationSoftware Developer | 93523.3378 | 5242.671238 | 17.838871 | 0.0000000 |
| occupationTeacher | -13507.9435 | 4414.906643 | -3.059621 | 0.0022232 |
| genderMale | 18514.3476 | 2276.242924 | 8.133731 | 0.0000000 |
# Enhanced model diagnostics
par(mfrow = c(2,2))
plot(model1, col = income_palette[1], pch = 19, cex = 0.7)# Fit polynomial regression
model2 <- lm(income ~ poly(age, 3), data = data_working_age)
# Create a static plot
ggplot(data_working_age, aes(x = age, y = income)) +
geom_point(alpha = 0.1, color = income_palette[1]) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3),
color = income_palette[5], fill = income_palette[5], alpha = 0.2) +
labs(title = "Polynomial Regression: Age vs Income",
subtitle = "Cubic polynomial fit with confidence interval (Working Age Only)",
x = "Age",
y = "Income (HUF)") +
custom_theme# Model comparison
model_comparison <- data.frame(
Model = c("Multiple Linear", "Polynomial"),
R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared),
Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)
kable(model_comparison, caption = "Model Comparison (Working Age Only)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Model | R_squared | Adj_R_squared |
|---|---|---|
| Multiple Linear | 0.6765319 | 0.6757960 |
| Polynomial | 0.5780235 | 0.5778722 |
# Create a sample prediction
new_data <- data.frame(
age = 35,
city = "Budapest",
occupation = "Software Developer",
gender = "Male"
)
# Predict income
prediction <- predict(model1, newdata = new_data, interval = "prediction")
print(prediction)## fit lwr upr
## 1 547309.8 343154.5 751465.1
The analysis of the simulated Hungarian income data reveals several interesting patterns:
The analysis demonstrates the complex interplay between various factors affecting income levels in Hungary. The findings suggest that both demographic factors (age, gender) and professional factors (occupation, location) play important roles in determining income levels.